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Parameters defined via general estimating equations (GEE) can 
be estimated by maximizing the empirical likelihood (EL). Newey and 
Smith [Econometrica 72 (2004) 219-255] have recently shown that 
this EL estimator exhibits desirable higher-order asymptotic prop- 
erties, namely, that its 0(n _1 ) bias is small and that bias-corrected 
EL is higher-order efficient. Although EL possesses these properties 
when the model is correctly specified, this paper shows that, in the 
presence of model misspecification, EL may cease to be root n con- 
vergent when the functions defining the moment conditions are un- 
bounded (even when their expectations are bounded). In contrast, 
the related exponential tilting (ET) estimator avoids this problem. 
This paper shows that the ET and EL estimators can be naturally 
combined to yield an estimator called exponentially tilted empirical 
likelihood (ETEL) exhibiting the same 0(n~ ) bias and the same 
0(n~ 2 ) variance as EL, while maintaining root n convergence under 
model misspecification. 

1. Introduction. Statistical models defined via general estimating equa- 
tions (GEE) of the form E[g(x, 9)] = 0, where g(x, 9) is a vector-valued 
nonlinear function of a random vector x and a parameter vector 6, are 
very common in statistics. In such models, the parameter vector 9 is tradi- 
tionally estimated using two-step efficient generalized method of moments 
estimators (GMM) [21]. Over the last two decades, various one-step alter- 
natives to two-step GMM have been suggested. Perhaps the best known 
estimators of this class are the empirical likelihood (EL), exponential tilt- 
ing (ET) and GMM with continuous updating (CU) estimators, which have 
been previously studied in the econometrics [22, 26, 27, 35, 47] and statistics 
[37, 45, 48, 49, 50, 53] literatures. While all of these alternative estimators 
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of 9 share the first-order efficiency of efficient two-step GMM, their one-step 
nature provides them with desirable properties not enjoyed by GMM. In 
addition to bypassing the arbitrariness in the choice of first-step estimate 
(since any consistent estimate of 9 can, in principle, be used as a first step 
and lead to slightly different second-step estimates in finite samples), these 
one-step estimators are also invariant under general parameter-dependent 
linear transformations of the vector of moment conditions [30, 50] and pos- 
sess superior higher-order asymptotic properties [27, 28, 29, 47]. 

Considerable effort has been devoted to identifying which of these alter- 
native estimators, EL, ET or CU, is preferable. Since all of these estimators 
are asymptotically equivalent up to O p (n~ 1 / 2 ) when the overidentifying re- 
strictions are valid, differences must reside in their higher-order asymptotic 
properties or in their behavior under potential model misspecification. The 
CU estimator is generally regarded as less desirable than EL and ET because 
its objective function has often been observed to possess multiple modes 
[22, 30] and because it lacks the ability to generate likelihood ratio-based 
confidence regions whose shape adapts to the support of the data [4, 50]. 
Comparing ET and EL proves to be more difficult. On the one hand, based 
on a stochastic expansion argument, Newey and Smith [47] have established 
that EL should typically have a lower finite-sample bias than both ET and 
CU. Also, they have shown that bias-corrected EL is higher-order efficient 
than any other regular method of moments estimator. On the other hand, 
Imbens and co-workers [27, 30] have indicated that EL, unlike ET, exhibits 
a singularity in its influence function, suggesting that ET should be better 
behaved than EL in the presence of model misspecification. In addition, ET 
admits a computationally convenient treatment of misspecified models [32]. 

Although it can be argued that model misspecification can always be 
avoided through the use of specification tests, an alternative view is that 
most models are only approximations to the underlying phenomena and are 
therefore intrinsically misspecified. Accordingly, there exists a growing liter- 
ature devoted to the study of so-called globally misspecified models (in which 
the misspecification does not vanish asymptotically). The classic theory of 
maximum likelihood estimators (MLE) when the distributional assumptions 
are misspecified can be found in [1, 25, 63, 64]. In this context, MLE consis- 
tently estimates the so-called pseudo-true value of the parameter of interest 
[56], which is defined as the parameter value associated with the distribu- 
tion which is the closest to the true data generating process according to 
the so-called Kullback-Leibler information criterion (KLIC) discrepancy. 

In recent years, the analysis of misspecified models has been actively ex- 
tended to various extremum estimators [2, 13, 44, 51] and, in particular, 
to overidentified moment condition models [8, 18, 26, 32, 34, 41]. Overi- 
dentified models arise naturally in a number of applications. For instance, 
consider a regression model y = x'9 + e where e is correlated with x (so that 
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least squares cannot be used) but uncorrelated with a vector of so-called 
instruments (denoted z). This leads to a vector of restrictions of the form 
E[(y — x'0)z] = 0, the dimension of which typically exceeds the dimension of 
0. Given the overidentified (i.e., overdetermined) nature of the restrictions, 
it is then possible that no value of 6 simultaneously satisfies all the moment 
restrictions exactly in the population, resulting in a misspecified model [41]. 
A more extensive discussion of misspecified models as well as many refer- 
ences to empirical studies that perform inference with models which fail 
standard specification tests can be found in [18]. 

The motivation behind this interest for misspecified models stems from 
two observations. First, the imperfections of a model, although statistically 
detectable, may nevertheless be small in absolute terms and consequently 
have little impact on the results ([42], pages 1168-1169). Second, a misspec- 
ified but parsimoniously parametrized model may have better predictive 
power than a more realistic complex model which passes all specification 
tests ([9], page 596). At fixed sample size, as the number of parameters 
increases, their variances tend to increase as well, while the power of overi- 
dentification tests tends to decrease. 

This paper is organized as follows. After briefly reviewing the properties 
of the EL, ET and CU estimators, we present a simple result that character- 
izes EL's poor behavior under misspecification in order to motivate the need 
for a new estimator. We then introduce an estimator called exponentially 
tilted empirical likelihood (ETEL) that naturally combines EL and ET, ex- 
tending an approach previously considered in [10, 31, 40] for constructing 
likelihood-ratio confidence regions for the mean to the case of point esti- 
mation of parameters defined via general moment restrictions. The ETEL 
estimator is shown to be well behaved under model misspecification, like 
ET, while preserving the desirable higher-order asymptotic properties of EL 
established in [47]. Finally, simulations are used to illustrate the usefulness 
of this estimator. All proofs can be found in the Appendix. 

2. Existing one-step alternatives to GMM. 

2.1. Generalities. We first introduce our notation. 

Definition 1 . Let 8 denote the parameter vector of interest belonging 
to a compact subset of M. Ne . Let X{ be sequence of random vectors taking 
values in X C WL Nx . Let g(xi,0) denote a vector of moment functions taking 
value in W Ng and satisfying E[g(xi,8*)] = at 9* £ 0. Let n denote sample 
size and let all summations be over 1, . . . ,n. Let || • || denote any convenient 
vector or matrix norm. 
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Table 1 

The EL, ET and CU estimators as particular cases of MED and GEL estimators 

(adapted from [47]) 



Estimator 


7 


h(w) 




r(0 


EL 
ET 
CU 
ECR 


-1 

1 

7 


— In ww 
nw In nw 

(nw) 2 

(n™)T + 1 -l 
7(7 + 1) 


ln(l - £) 
-exp(f) 
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- exp(C) 
-(1 + 
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The simplest way to summarize the properties of the EL, ET and CU 
estimators is to embed them in more general families of estimators. All 
three estimators admit two convenient representations. They can first be 
interpreted as minimum empirical discrepancy (MED) estimators [10, 11], 

(1) = argmin(n~ 1 y h(w t (6)) ), 

V i J 

where vbi(9) is the solution to 

(2) min n^J^h^i) 

subject to moment and normalization constraints, 

(3) ^2wig(xi,6) = and ^1^ = 1. 

i i 

(The term empirical discrepancy is used here to emphasize the fact that it 
is a discrepancy between measures supported on the sample rather than on 
a fixed discrete support.) Different choices of the discrepancy measure h(-) 
yield distinct estimators, as given in Table 1. Specific choices of h(-) have 
historically been given special names. The discrepancy used in EL, h(w) = 
— Innw, is known as the Kullback-Leibler information criterion (KLIC). 
Also, rewriting the minimization problem as an equivalent maximization 
problem, EL can be thought of as maximizing the "likelihood." In a similar 
fashion, ET, with h(w) = nwln(nw), can be interpreted as maximizing a 
quantity known as entropy. 

The minimum discrepancy formulation emphasizes that the estimator 
seeks to "reweight" the sample in order to satisfy the moment conditions ex- 
actly. The function h(wi) quantifies the amount of reweighting taking place 
and penalizes values that differ from Wi = n _1 . The point estimate 6 is the 
value that minimizes the discrepancy between uii(9) and uniform weights. 
The weights Wi{6) are sometimes called implied probabilities because they 
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can be used to construct more efficient empirical estimates of the data gen- 
erating process [3, 26, 53]. 

EL, ET and CU can also be characterized as particular cases of the so- 
called generalized empirical likelihood (GEL) family of estimators [61], 

(4) = argmin ( £ P (\(0)'g{x u 6)) j , 
where the iV 9 -dimensional vector X(9) is given by 

(5) A(0) = argmaxjV 1 ^>(A'^, 9)) J . 

The choice of the function p(-) defines the estimator used, as described in 
Table 1. The advantage of the GEL formulation is the computational con- 
venience of solving an (N g + iV^-dimensional optimization problem rather 
than the (n + Ng)- dimensional problem defining an MED estimator. 

As pointed out in [47], only specific choices of h(-) lead to an estimator 
admitting an equivalent GEL representation. A particularly rich class of 
such discrepancies is given by the Cressie-Read (CR) discrepancies [11], 

(6) KWi) = 7 ( 7 + l) ' 

where 7 is the parameter indexing the family. The corresponding />(•) is 
given in Table 1. The GEL representation of an MED estimator is called a 
dual problem because it amounts to reformulating the optimization in terms 
of the Lagrange multiplier A of the moment constraints. Newey and Smith 
[47] conjecture that Cressie-Read discrepancies may be the only discrepan- 
cies admitting a GEL representation. The weights attributed to the sample 
points in the original MED estimator can be recovered from 

(7 ) MO)- rCm '^> 



E.T(A(e)' 9 (*j,9)) 



where t(£) = dp{^)/d^. We will refer to r(£) as the tilting function because, 
as seen in equation (7), it indicates how the sample points are reweighted. 
The EL, ET and CU estimators are all members of this class (see Table 1) of 
empirical Cressie-Read (ECR) estimators. For a more detailed description of 
these families of estimators, we refer the reader to the excellent discussions 
found in [47, 50]. 

2.2. Comparing the ECR estimators. Let us first give the properties 
shared by all ECR estimators. For just-identified models {N g = Ng), all of 
these estimators are trivially identical because the moment conditions can 



G 



S. M. SCHENNACH 



be satisfied exactly simply by choosing 9 appropriately without the need for 
tilting (X(9) = 0). In over-identified models for which the over-identifying re- 
strictions are valid, all ECR (and GEL) estimators possess the same asymp- 
totic variance [47] , which is equal to the asymptotic variance of the two-step 
efficient GMM estimator. All ECR estimators also enable the construction 
of confidence regions for the mean (g(xi,9) = Xi — 9) through convenient x 2 - 
calibrated likelihood-ratio tests [4]. In light of the results in [53], Baggerly's 
results should extend to general g(xi,9). 

The similarities end at the level of first order asymptotic properties in cor- 
rectly specified models, however. As noted in [4], the behavior of the implied 
probabilities Wi{9) in finite samples differs markedly as a function of the sign 
of the parameter 7. For ECR with 7 < 0, the implied probabilities Wi(9) are 
positive by construction, while for 7 > 0, they can take on negative values. 
In a correctly specified model (where the implied probabilities converge to 
n _1 for all ECR), negative weights become decreasingly likely as sample size 
grows and it is possible to entirely avoid negative weights via the use of a 
"shrinkage factor" correction (see [6]) that vanishes asymptotically and that 
has no effect on the limiting distribution. Nevertheless, under misspecifica- 
tion, the "shrinkage factor" correction does not vanish asymptotically since 
negative weights remain likely even asymptotically when 7 > 0. 

Positive implied probabilities are associated with likelihood-ratio confi- 
dence regions whose shape better adapts to the data [4, 50]. For instance, 
confidence regions for the mean then always lie within the convex hull of the 
support of the density of the corresponding random variable. Positive im- 
plied probabilities are also important in applications that require empirical 
estimates of the data generating process, as in the bootstrap, for instance, 
[7]. These observations indicate that EL (with 7 = — 1) and ET (with 7 = 0) 
should be preferable to CU (with 7 = 1). CU also suffers from a different 
problem, namely the potential presence of multiple local maxima in its ob- 
jective function [22, 30]. 

Numerous authors have sought to further narrow down the choice of de- 
sirable ECR estimators. EL is often singled out among the ECR because 
it leads to likelihood ratio tests that are often, though not always, Bartlett 
correctable [10, 14, 39]. Newey and Smith [47] have recently shown that EL 
generally exhibits a smaller 0(n _1 ) bias than any other member of the ECR 
family [unless the centered third moments of the distribution of g(xi,9*) hap- 
pen to all vanish, in which case all ECR estimators have the same 0(n _1 ) 
bias]. They have also shown that bias-corrected EL is higher-order efficient, 
possessing an 0(n~ 2 ) variance that is no greater than that of any other 
bias-corrected regular method of moments estimator. 

2.3. Behavior under misspecification. As mentioned in the Introduction, 
in the presence of misspecification, the object of interest is the pseudo-true 
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value of the parameter vector. In the case of MED estimators, the pseudo- 
true value is defined as the value of 9 which minimizes the population version 
of the empirical discrepancy used in the estimation procedure. 

It is important to note that although two different estimators may consis- 
tently estimate the truth in a correctly specified model, they may converge 
in probability to different pseudo-true values in the presence of misspeci- 
fication. These two pseudo-true values merely represent the minimizers of 
two different well-defined discrepancies. Even though it could be argued 
that pseudo-true values are generally "biased," the literature on estimation 
under model misspecification considers estimators of pseudo-true values as 
valid statistics for the purpose of inference (see [56] , as an early reference) . 
Following the recent literature using various ECR estimators under model 
misspecification, we will not argue whether any ECR has a "better" pseudo- 
true value than another in a given context. Instead, we will compare the 
convergence of various ECR estimators toward their respective pseudo-true 
values — a property that will be relevant regardless of the context of interest. 

Imbens, Spady and Johnson [30] have informally argued that EL may be 
ill-behaved under model misspecification due to the fact that its influence 
function [20] is proportional to 

1 dg'(xj,d*) 
l-X'g(xi,e*) 89 

where the denominator (1 — X'g(xi,6*)) can approach zero. We formalize 
this concern by showing that EL suffers from a dramatic degradation of its 
asymptotic properties under even the slightest amount of misspecification. 

Theorem 1. Let X{ be an i.i.d. sequence and assume g(x,6) is twice con- 
tinuously differentiable in 9 for all x and all 9 £ and such that sup^gg £[||g(xj, 
#)|| 2 ]<oo. If mig e Q\\E[g(xi,9)]\\ ^ and sup xeX u'g(x, 9) = oo for any 9 E 
and any unit vector u, then there exists no 9^ L € such that ||#el — 
9* EL \\=O p (n- 1 / 2 ). 

This theorem can be extended to the case where the moment function 
g(xi,9) diverges only along some directions u but not others. In that case, 
the lack of root n consistency is avoided only when E[g(xi,9*)] happens to 
be orthogonal to the hyperplane along which g(xi,9) diverges. 

While Theorem 1 does not prevent #el from being a consistent estima- 
tor of the pseudo-true value # EL , it does preclude 9el from being root n 
consistent, under the assumptions of the theorem. The proof of this result, 
which can be found in the Appendix, is somewhat involved, because stan- 
dard asymptotics break down for EL under misspecification with unbounded 
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g(x,9). The following heuristic argument illustrates the nature of the prob- 
lem: First note that the EL implied probabilities are given by 



and must be positive [49] . This implies that A — > 0, for otherwise, maxj< n X'g(xi, 
9*) would become unbounded as n — ► oo, causing some Wi to become neg- 
ative. Now, the population version of the first order condition for A is 
E[g(xi,9*)/(1 - X*'g(xi,8*))] = 0, where A* and 9* denote pseudo-true val- 
ues. Yet, at the pseudo-true value A* = plimA = 0, this expectation takes the 
value E[g(xi,9*)], which is not zero, by the assumption of misspecification. 
Hence, the asymptotics of EL cannot be determined from a standard expan- 
sion of the first-order conditions around the pseudo-true values that satisfy 
the first-order conditions in the population. The limit as n — > oo and as 
A — > cannot be freely exchanged, indicating that the moments entering the 
first-order conditions violate the standard dominance regularity conditions 
used to establish the asymptotics of M-estimators [46]. 

Theorem 1 indicates that, unless one is willing to solely use moment func- 
tions that take values in a compact set [so that sup xeX u'g(x, 9*) is bounded 
for any u], the slightest amount of misspecification can cause the first-order 
asymptotic properties of EL to degrade catastrophically. It is important to 
realize that it is very common that the function g(x,9) itself is unbounded 
even when E[g(x,9)] is finite. For instance, if g(x,9) = (xi — 9,X2 — 1)' and 
x= {x\,X2) is drawn from a bivariate normal, g(x,9) is unbounded even 
though E[g(x,9)] exists. 

Of course, when the main hypothesis of Theorem 1 (sup x£X \\g(x,9)\\ < 
oo) does not hold, root n consistency becomes possible. For instance, the 
type of moment conditions advocated in the robustness literature (e.g., 
[20, 24]) involves bounded functions and root n consistent estimation under 
misspecification, therefore possible using EL. Nevertheless, Theorem 1 does 
rule out moment conditions such as a simple average of random variables 
drawn from a distribution with unbounded support. 

Theorem 1 is especially important given the growing literature on min- 
imum empirical discrepancy estimators in misspecified models [8, 23, 26, 
32, 34, 54, 61]. In the nonnested model selection literature using minimum 
discrepancies, it is often assumed that the competing models may be all mis- 
specified and one is merely concerned with choosing the least misspecified 
model (e.g., [8, 32, 34]). Since the model that is eventually used for inference 
may then be misspecified, Theorem 1 is particularly relevant in this context 
and indicates that EL may not be well suited to these applications — unless 
the assumption of bounded g{xi,9) is made, which is precisely the assump- 
tion that the model selection literature using EL has so far relied upon 
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EL's implied probability weights also exhibit questionable behavior un- 
der misspecification with unbounded g(xi,0). Since the EL implied proba- 
bilities wi = n~ x (l — X' g(xi,9))~ 1 must be positive [4], it is straightforward 
to see that A — > when g(xi,9) is unbounded. Then note that the implied 
probabilities associated with all points Xi such that g(xi,9) G C for a given 
compact set C converge to n" 1 uniformly. Since this result holds for any 
compact set, this shows that, as sample size grows, all the adjustments to 
the implied probabilities become concentrated on the extreme observations. 
This would be desirable if the weights of these extreme observations were 
always decreased to ensure that the moment conditions are satisfied, but 
this is not the case. In fact, due to the convexity of EL's tilting function 
r(£) = 1/(1 — £), the reweighting of the sample in order to satisfy the mis- 
specified moment conditions will be achieved by placing a large weight on 
a few extreme observations, while slightly reducing the weights (relative to 
n _1 ) of the bulk of the observations. Note that this problem is exacerbated 
by the fact that the weights can become extremely large as the singular- 
ity in the tilting function is approached. This feature will be visible in our 
simulations below. 

We conjecture that any ECR estimator with 7 < exhibits the same prob- 
lems as EL under misspecification due to the presence of a ratio in the tilting 
function. Thus, if we focus solely on ECR which preclude negative implied 
probabilities (7 < 0), we are left with ET (corresponding to 7 = 0) as the 
only candidate ECR whose behavior might not degrade dramatically under 
misspecification. This is precisely the choice made in [32] for the analysis 
of misspecified moment restriction models. The asymptotic variance of ET 
under misspecification is finite under reasonable assumptions, the most re- 
strictive of which is slightly stronger than the requirement of the existence 
of the moment generating function Mg(X) = E[exp(X' g(xi, 9))] for 9 and A 
in some bounded sets. 

3. Exponentially tilted empirical likelihood. Higher-order asymptotic prop- 
erties in correctly specified models point to EL, while good behavior under 
misspecification points toward ET. There appear to be significant benefits 
to be able to combine EL and ET into a single estimator exhibiting the 
advantages of both. 

It has been suggested [10, 47, 50] that other GEL estimators that exhibit 
the same higher-order properties as EL can be devised by simply employ- 
ing a tilting function r(£) which admits the same Taylor expansion as the 
tilting function of EL in the vicinity of £ = up to sufficiently high order. 
The behavior of r(£) farther away from £ = could then be independently 
set to match the behavior of ET under misspecification. This option is not 
particularly attractive because (i) the estimator completely loses its inter- 
pretation as a minimum empirical discrepancy estimator, (ii) the estimator 
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can no longer be seen as either a maximum likelihood or a maximum entropy 
estimator, concepts that initially motivated the form of the EL and ET es- 
timators, and (iii) there still exist an infinite number of ways to interpolate 
between EL and ET in order to construct t(£), making the procedure highly 
nonunique. For these reasons we focus on a different approach. 

3.1. The estimator. We propose to combine the EL and ET estimators 
in the following fashion. 

Definition 2 (ETEL estimator). 

(9) ^argmin^n -1 ^^^))^, 

where Wi(9) is the solution to 

(10) min n- l ^2h(wi) 

W?=i • 

subject to 

(11) ^2wig(xi,9) = and ^iOj = l, 

i i 

and where 

(12) h(wi) = -\n(nwi), 

(13) h(wi) = nwi \a{nwi). 



The discrepancies used in the above optimization problem correspond 
to using ET to find Wi{9) and using EL to find 6. Since h(-) belongs to the 
family of ECR discrepancies, this type of estimator still admits an (N g + Ng)- 
dimensional dual optimization problem of the form 

(14) 9 = argminn -1 h(wi(6)), 

o i 

where vbi(9) is given by equation (7) with 

(15) A(0) = argmax^n" 1 ^ p{>!g{x h 6)) \ 

and p(£) = — exp(£). This approach yields a unique estimator that combines 
the likelihood form of EL [equation (9)] while incorporating the concept of 
entropy characterizing ET [equation (10)]. For these reasons, we call this 
estimator exponentially tilted empirical likelihood (ETEL). Other authors 
[10, 31, 40] have considered this combination of EL and ET for the purpose 
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of constructing likelihood-ratio confidence regions for the mean. It has also 
been shown that a nonparametric Bayesian procedure based on a prior on the 
space of distributions that favors distributions having a large entropy yields 
a posterior whose maximum would define the ETEL estimator [58]. This 
paper's contribution will be to identify the numerous desirable asymptotic 
properties of ETEL point estimates in the case of general moment functions 
g(xi,9) in the context of overidentified and possibly misspecified models. 

The fact that the ETEL point estimate is the solution to two nested opti- 
mization problems (one of dimension N g and one of dimension Nq) instead 
of a single saddle-point problem does not complicate the implementation of 
the estimator. Indeed, ECR estimators are often implemented as two nested 
optimization problems despite their saddle point form, because it is easier to 
design robust numerical methods for locating either a maximum or a mini- 
mum that do not break down near inflection points of the objective function 



ETEL represents only one of the many possible combinations between two 



using the EL discrepancy to find 9 stands out as a particularly attractive 
choice because the optimization problem defining 9 maintains the maximum 
likelihood form of EL, thus making it more likely that EL's higher-order 
properties will be preserved, an issue that will be investigated below. The 
use of the ET discrepancy to find the weights vbi{9) is also natural. Since the 
objective function for 9 contains \n(wi(6)) , it is imperative that the weights 
u>i(8) be positive by construction and not only asymptotically in correctly 
specified models. As noted earlier, if we focus on weights obtained from the 
ECR family, in order to maintain the low dimensional dual formulation, 
only ECR with 7 < provide positive weights by construction [4]. However, 
any ECR with 7 < contains a singularity in its influence function, leaving 
7 = 0, or ET, as the only sensible choice to find the weights in the presence 
of potential model misspecification. 

From a conceptual point of view, one may wonder about the interpre- 
tation of the ETEL estimator, since its definition combines two different 
discrepancies. It is often pointed out that in the case of discrete distribu- 
tions, EL provides maximum likelihood estimates of both 9* , the true value 
of the parameter vector of interest, and the weights. Since ET weights are 
used in ETEL, ETEL weights are not maximum likelihood estimates, but in 
itself this is not a great concern since the weights are nuisance parameters 
and inference focuses on 9*. Indeed, after solving for all the parameters in 
terms of 6, both the ETEL and EL estimators of 9* can be cast into the 
familiar form of a maximum likelihood estimator of 9* (as opposed to both 
9* and the weights, as in EL), 



[43]. 
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where Wi(9) is given by equations (7) and (5). Of course, such an estimator 
can only formally be identified as a maximum likelihood estimator in the 
special case of a discrete distribution having support consisting of a finite 
number of points. More generally, for continuous distributions we can nev- 
ertheless refer to 9 as a MED estimator of 9* using the KLIC discrepancy 
(as for maximum likelihood estimators), an interpretation that will be rele- 
vant under model misspecification. The distinction between EL and ETEL 
lies in how the estimate of the distribution of the data generating process 
Wi(8) given 9 is constructed. In a parametric likelihood, Wi(9) would be 
uniquely given by the distributional assumptions of the model. When mo- 
ment conditions replace distributional assumptions, however, there exists no 
such unique choice of Wi(9), due to the nonparametric nature of the prob- 
lem. Both EL and ETEL replace parametric distributional assumptions by 
a so-called least favorable family of distributions (see, e.g., [15]), that is, 
a parametric family of distributions (indexed by 9) for which the estima- 
tion problem is as difficult as the original nonparametric problem. In other 
words, for each 9 there exist an infinite number of distributions satisfy- 
ing the moment conditions, and the specific discrete distribution defined by 
Wi(9) represents a worst-case scenario among them. As pointed out in [15], 
there exist numerous least favorable families; EL and ETEL merely employ 
different ones and, a priori, there is no reason to favor one over the other. 

In the case of ETEL, the least favorable family chosen is the class of dis- 
tributions obtained by maximizing entropy under the ^-dependent moment 
constraints imposed by the model. Entropy maximization has a long history 
as a device to construct distributions which properly model lack of prior 
information under a set of known constraints (see, e.g., [12, 17, 36, 38, 60]). 
ETEL thus combines the well-established concept of entropy maximization 
to handle the nonparametric part of the estimation problem, while using like- 
lihood maximization to deal with the parametric part of the problem. The 
idea of substituting nonparametric nonmaximum likelihood estimates [here, 
the Wi{9)} into a likelihood-type objective function to avoid the pathological 
behavior of an approach based solely on maximum likelihood also parallels 
the work of Fan and co-workers [16]. 

One may have preferences regarding which estimator of the distribution is 
the more appealing, but the choice between EL and ETEL should ultimately 
be based on the comparison of the actual asymptotic properties of each 
estimator and their performance in simulation experiments, which is what 
the remainder of this article is devoted to. 

3.2. Properties. 

3.2.1. First-order properties. To simplify the notation, we make the de- 
pendence of all quantities on 9 implicit and introduce the following defini- 
tions. 
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Definition 3. Let Wi = Wi(9), A = A(0), gi = g(xi,9), g = n~ x x 
T*g(xi,0), G i = dg(x l ,9)/d9>, G = E[Gi], G = n~ 1 j: i G i , G = J2iWiG u 
Cl = n~ l J2i 9i9i, ^ = E[9i9i\ an d ^ = J2i w i9i9i- Quantities evaluated at 
9 = 9* are denoted by *. 

Simple algebraic manipulations yield the following. 

Theorem 2. The ETEL estimator #etel maximizes the objective func- 
tion 

(17) lnL(fl) = -ln^i-^exptA'Gfc - g))) , 
where A is such that 

(18) n- 1 ^exp(A / ^)ft = 0. 

i 

The first- order conditions for #etel can be written as 

(19) _ nTBi )<M = , 

j 

where the total derivative indicates that A is allowed to vary with 9. 

We then establish that ETEL is at least as good as any ECR estimator 
both in terms of its first-order asymptotic properties and in terms of its 
invariance properties. 

Assumption 1 (Regularity conditions). 

1. Xi forms an i.i.d. sequence. 

2. 9* £ int(0) is the unique solution to E[g(xi,9)} = 0, where 6 is compact. 

3. g(xi,9) is continuous (in 9) at each 9 £ with probability one. 

4. £ , [sup 6)ee ||gi|| 2+5 ] < oo for some 5 > and E[swp ee j^ ||Gj||] < oo. 

5. f2* is nonsingular and finite and rank(G*) = Nq. 

6. g(xi,9) is continuously differentiable (in 9) in a neighborhood N of 9*. 

These assumptions match those of Theorem 3.2 in [47] and include those 
of Theorem 3.4 in [50]. 

Theorem 3 (First-order properties). Under Assumption 1, the ETEL 
estimator (\) has the same limiting distribution as efficient two-step GMM, 

n 1/2 {9-9*)^N{0,Z), 



14 



S. M. SCHENNACH 



where £ = (G*'(Q*) 1 G*) 1 , (ii) ETEL enables the construction of x 2 - 
calibrated likelihood-ratio confidence regions for 9, 

-2nln(L(9)/L0))S x % e , 

and (iii) of x 2 - calibrated test of the validity of overidentifying restrictions, 

Theorem 4 (Implied probabilities and invariance properties). When- 
ever the ETEL estimator is defined, (\) it yields positive implied proba- 
bilities (wi{6) > 0), (h) it is invariant under arbitrary one-to-one differ- 
entiable reparametrization 9 = T{0) of the moment conditions [the esti- 
mate (3 obtained from the reparametrized moment conditions satisfies 9 = 
T{(3)} and (\\\) it is invariant under general parameter- dependent nonsin- 
gular linear transformation A{9) of the vector of moment conditions (using 
E[A(9)g(xi, 9)] = or E[g(xi,9)} = as moment conditions gives the same 
9). 

3.2.2. Higher-order asymptotic properties. Estimators having the same 
(first-order) asymptotic variance can be compared on the basis of their 
higher-order (o p (n -1 / 2 )) asymptotic properties [55]. While it has been estab- 
lished that likelihood-ratio confidence regions of the mean constructed using 
ETEL do not share EL's Bartlett correctability [10, 31], another type of an- 
alytic higher-order correction permits the same improvement in the order 
of the coverage accuracy [40]. Moreover, it has been observed in simulation 
studies [50, 62] that the Bartlett correction is often ineffective in practice be- 
cause the "QQ" plots for the EL likelihood ratio test statistics are typically 
curved, making it unlikely that a linear correction such as Bartlett's would 
improve coverage accuracy. Finally, given that ETEL's objective function 
can be interpreted as a posterior for the parameter 9 obtained via a non- 
parametric Bayesian procedure [58] , it may be a more relevant and interest- 
ing topic of future research to verify whether a Bayesian Bartlett correction 
[5], which differs from the usual frequentist Bartlett correction, would be 
applicable to ETEL. 

More importantly, we can show that the ETEL point estimate #etel 
shares all of the other higher-order properties of EL established in [47]. 
Higher-order asymptotic properties of an estimator 9 are defined through a 
stochastic expansion (see, e.g., [52, 55]) of the form 

(20) (0 - 9*) = n~ 1 / 2 ^ + n~ l q + n~ z l 2 f + O p {n~ 2 ), 

where vp, q and f are O p (l) and where tp and f have zero mean. Within 
this framework, the 0(n~ l ) bias is defined as E[q] and represents the most 
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important correction to standard first-order asymptotics based solely on the 
influence function ip. Another important correction to first-order asymp- 
totics is the 0(n -2 ) variance, defined as 

(21) Var[g] + Covarff , + Covar^, f]. 

This expression can be informally obtained by computing the variance of 
equation (20). In general, it is not meaningful to compare the 0(n~ 2 ) vari- 
ances of two estimators that possess different 0{n~ l ) biases and bias-corrected 
estimators should be used to compare efficiency. 

We now proceed to compare the stochastic expansions of #etel and #el > 
using assumptions found in [47]. Our approach consists of establishing that 
the difference #etel — #el is such that the Newey and Smith results for 9el 
carry over to #etel- 

Theorem 5 (Higher-order equivalence to EL). Under Assumption 1 
and if E[sup d( zj^ \\gi\\ 4 ] < oo , E[supg e j^/ ||Gj|| 2 ] < oo andforO^M, G(xi,9) 
is Lipschitz in 9 with prefactor b{xi) such that E[b(xi)] < oo, then #etel — 
f?EL = P (n~ 3 / 2 ). 

A consequence of this result is that the ETEL estimator has the same 
0(n -1 ) bias as the EL estimator obtained in [47], under their assumptions. 
(As shown in [59], this result in fact extends to all estimators constructed 
by substituting GEL weights into the EL objective function.) 

Assumption 2. There exists a function b(xi) with E[(b(xi)) 6 ] < oo such 
that, in a neighborhood J\f of 9*, all partial derivatives of g(xi,9) with respect 
to 9 up to order four exist, are bounded by b{xi) and are Lipschitz in 9 with 
prefactor b(xi). 

Theorem 6 (Small bias property). Under Assumptions 1 and 2, ETEL' 's 
0(n~ l ) bias is 

n- 1 H(-a + E[GiHgi\) 

where H = EG'fi -1 and a is a vector whose elements are aj = tv(Y>E[d 2 gj(xi, 
9*)/d9d9'])/2, where gj(x h 9*) is the jth element of g(xi,9*). 

Simple intuition for the small bias of EL is that the EL first-order condi- 
tion resembles the first order condition of GMM (g'Q,' 1 ^ = 0) except for the 
fact that the Hessian term Q and the Jacobian term G are replaced by effi- 
cient averages that are weighted by the EL implied probabilities [47]. This 
reweighting removes the 0(n~ l ) correlation between the different sample 
averages entering the first-order condition, thus reducing the bias. As shown 
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in the Appendix, ETEL also efficiently weights the Hessian and the Jacobian 
terms, using only the ET weights. Since ET and EL implied probabilities 
are equivalent to a sufficiently high order, using the ET instead of the EL 
weights only contributes to a negligible O p (n~ 3 / 2 ) remainder. 

The fact that ETEL and EL are equivalent up to O p (n _1 ) leads to two 
important simplifications in the comparison of their 0(n~ 2 ) variances. First, 
since their 0(re _1 ) biases are the same, the moments entering the expression 
for the bias correction of EL and ETEL are the same. If these moments were 
estimated in the same way for EL and for ETEL, then comparing the 0(n~ 2 ) 
variance of EL and ETEL with or without performing a bias correction 
would obviously give the same answer. This conclusion remains unchanged 
if the bias correction is applied using the EL estimate of 9 for the EL bias 
correction and the ETEL estimate of 9 for the ETEL bias correction since 
these estimates differ by O p (n~ 3 / 2 ), which would give rise to a difference of 
only O p (n~ 1 n~ 3 / 2 ) in the bias correction. Moreover, as pointed out in [47], 
whether the moments entering the bias correction are estimated by sample 
averages or averages weighted by implied probabilities has no effect on the 
higher-order variance of the resulting bias-corrected estimator. Hence, using 
EL weights for the EL bias correction and ETEL weights for the ETEL bias 
correction makes no difference either. In conclusion, we can meaningfully 
compare the 0(n~ 2 ) variances of EL and ETEL before performing a bias 
correction. 

The second simplification made possible by the equivalence of the O p {n~ l ) 
terms of the EL and ETEL stochastic expansions is that the differences in 
their 0(n~ 2 ) variance must take the form 

(22) Covar [f ETEL - f EL , tp) + Covar [ip, f ETEL - f EL ] , 

as seen m (21). Hence, it is possible for ETEL and EL to differ by O p (ra~ 3 / 2 ), 
while still sharing the same 0(n~ 2 ) variance, as long as that difference is un- 
corrected with their (identical) influence function tp. In fact, this is precisely 
the shown in the Appendix. 

Theorem 7 (Higher-order efficiency) . Under Assumptions 1 and 2, the 
0(n~ 2 ) variances of ETEL and EL are equal. 

Maintaining the maximum likelihood form for the optimization prob- 
lem defining #etel thus achieves the desired goal, namely, to maintain the 
higher-order asymptotic properties of EL found in [47]. It is the fact that 
(22) vanishes that enables ETEL to be higher-order efficient even though it 
differs sufficiently from EL to fail to be Bartlett correctable in the frequentist 
sense. 
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3.2.3. Behavior under mis specification. While in the previous section we 
have seen that ETEL inherits the higher-order properties of EL, we will now 
show that it also exhibits some of the desirable properties of ET that EL 
lacks under model misspecification. 

Following the discussion of Section 3.1, ETEL's pseudo-true value 9* min- 
imizes the KLIC discrepancy between the true data generating process and 
an entropy maximizing least favorable family of distributions parametrized 
by 9 (which replaces the distributional assumptions in parametric maximum 
likelihood). 

We will now study the first-order asymptotic properties of ETEL under 
misspecification. 

Theorem 8. For a given 9, assume that E[exp(X' g(xi, 9))] exists in a 
neighborhood of its minimum. If a subvector of g(xi,9) is statistically in- 
dependent of the remaining elements of g(xi,9), then the empirical c.d.f. 
obtained from ETEL (or ET) implied probabilities at 9 converges pointwise 
(at every point of continuity of the true c.d.f.) to a c.d.f. that maintains 
this independence, even under misspecification. EL achieves this only in the 
absence of misspecification. 

This indicates the possibility that using an empirical c.d.f. obtained from 
the implied probability weights of EL in the hope of improving accuracy 
could actually result in the introduction of a spurious dependence among 
variables. ETEL avoids this unappealing eventuality. This property could be 
helpful when the implied probabilities are employed to improve the efficiency 
of the bootstrap, as in [7], when the model happens to be misspecified. 

A more important quality that ETEL shares with ET is the nonsingular 
behavior of its influence function. As noted by [30], an estimator's influence 
function ip(xi) is proportional to its first-order conditions. By inspection 
of ETEL's first-order condition [equation (19)] it is clear ETEL's influence 
function will not contain any singularity, unlike EL's influence function. It 
will therefore not be surprising that ETEL avoids EL's undesirable behavior 
under misspecification, under regularity conditions similar to the ones made 
by [32] for ET, as shown more formally below. 

Let X*(9) denote the solution to E[exp(\'g(xi, 9))g(xi, 9)] =0, which is 
unique by the strict convexity of -E[exp(A'<?(xj, 9))] in A. 

Assumption 3 (Regularity conditions under misspecification). 

1. Xi forms an i.i.d. sequence. 

2. The function lnL(0) = - ]n(E[exp(X*'(9)(g(xi, 9) - E[g(x h 9)})))) is max- 
imized at a unique "pseudo-true" value 9* E int(0), where O is compact. 

3. g(xi,9) is continuous (in 9) at each 9 G with probability one. 
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4. £'[sup 0g sup AgA (6») exp(A / g(xi, ^))] < oo where A(0) is a compact set such 
that A*(0) eint(A(0)). 

5. Sji(xi,9) = d 2 g(xi,9*)/ddjd9i is continuous (in 6) for £ jV, a neighbor- 
hood of 0*. 

6. There exists b{xi) satisfying ^[sup^g^sup^g^^ exp(ki\' g(xi, 0)) x 
(6(xi)) fc2 ] < oo for k x = 1,2 and fc 2 = 0,1,2,3,4 such that ||G(xi,0)|| < 

and HS^x,, 0) || < for j, Z = 1, . . . , Ng for any Xi £ X and for all 
8 £j\f. 

The simplest way to describe the asymptotics of ETEL under misspecifi- 
cation is to introduce an equivalent just-identified GMM estimator involv- 
ing an augmented parameter vector = (r, re', A', 0')'. The vector £ M. Ne 
is the parameter vector of interest, while (t,k',X')' £ R 1+2A » are auxiliary 
parameters to be estimated jointly with 0. The dimension of this augmented 
parameter vector is higher than in the case of GEL estimators under mis- 
specification (1 + 2N g + Ng instead of N g + Ng). This is due to the fact 
that the first-order conditions for in ETEL involve a few additional terms 
taking the form of a product of sample moments that are absent in GEL 
estimators. Each of these products of sample moments can be linearized by 
introducing the additional parameters k and r. Note that these additional 
parameters are merely a device used to simplify the construction of the 
covariance matrix of the estimator. The point estimate can be obtained 
without introducing k and r, as seen in Theorem 2. 

Lemma 9. The ETEL point estimate is given by the appropriate sub- 
vector of the vector (3 = (f, k', A', 9')' , the solution to 

i 

where, letting fj = exp(X gi), 

h) 
n) 
h)_ 
fi-f 

Ti9i 

{t - n)gi + Ti9i9ik 
.iiG'ik + nG'iXg'ik - fjG-A + fG-A. 



(t>{xi,f3) 



T 



° t~ I- - l\ 

~F\ T i9i K + T QiX 

aX 

-^(t^k + t&A- 
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Table 2 

The bias of the EL, ETEL and ET estimators for the 
Hall-Horowitz design 







EL 


ETEL 


ET 


K = 


4 


0.063 


0.061 


0.103 


K = 


10 


0.129 


0.103 


0.232 



Given the just-identified nature of the estimator defined in Lemma 9, its 
asymptotic distribution follows quite directly. 

Theorem 10 (Asymptotics under misspecification). LetT = E[d(j){xi, (3) / 
df3'\/3=/3*] and & = E[4>(xi, (3*)4>'(xi, (3*)]. Under Assumption 3, ifF is non- 
singular, then n l / 2 {(3 - 13*) 4 JVfO.r- 1 *^)" 1 ). 

4. Simulations. We first illustrate the fact that ETEL has the same 
Oin' 1 ) bias as EL. We use the simple experimental design suggested in 
[19] and subsequently used in [30, 33], slightly expanded to have K moment 
conditions rather than two. The moment conditions are 

g{xi,Q) = [r(xi,Q) r{xi,9)x i2 r(xi,9)(x i3 - 1) ••• r(xi,9)(x iK - 1)]', 

where r(xi,9) = exp(— 0.72 — (xn + xa)9 + 3x^2) — 1- These restrictions are 
satisfied at 9* = 3 when (xa,Xi2)' ~ N(0, (0.16)7) and x^ ~ Xii f° r & = 
3, . . . , K. Note that the third moments of all elements of g{xi,9) are nonzero 
and that g(xi,9) is nonlinear in 9, so that the 0(n~ l ) bias does not trivially 
vanish. Figure 1 shows the c.d.f. of the EL, ET and ETEL estimators of 9 
obtained from 10,000 replicated samples of the above design (with K = 4 and 
K = 10), each containing 200 observations. [Samples for which at least one 
of the three estimators considered failed to converge were discarded. This 
happened 14 times for K = 4 and 32 times for K = 10. The most frequent 
reason for failure of convergence was that the origin was not contained within 
the convex hull of the values of g(xi,9) for any 9, in which case none of the 
estimators is even defined. The number of nondiscarded samples is 10,000.] 
It is apparent that the ETEL and EL point estimates have very similar 
distributions, as expected from their equivalence up to the O p {n~ l ) term 
of their stochastic expansion. The distribution of the ET point estimates 
differs noticeably from that of EL and ETEL, and the main difference takes 
the form of a bias, which is reported in Table 2. The bias of ET increases 
more rapidly with the number of moment conditions than the biases of both 
EL and ETEL, as the higher-order asymptotics analyse given in [47] and in 
the present work would suggest. 
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Our next simulation compares the behavior of EL, ET and ETEL un- 
der misspecification. We consider a simple case where we wish to estimate 
the mean while imposing a known variance. In this example, the moment 
conditions are 

g(xi,9) = [xi-e (* 4 -#) 2 -l], 

where Xi is drawn either from a correctly specified Model C or a misspecified 
model M, 

Xi~JV(0,l) (for Model C), 

xi ~ N(0, (0.8) 2 ) (for Model M). 

Note that this experiment is specifically designed so that the pseudo-true 
value (6* = 0) for the misspecified model is the same for EL, ET and ETEL, 
thus enabling a meaningful comparison of the variances of these estimators. 

Figure 2 shows the c.d.f. of the EL, ET and ETEL estimators of 9 for a 
sample size of 1000 and a sample size of 5000, evaluated with 10,000 and 
2000 replications, respectively. The variability of the EL estimate is clearly 
larger than that of ET and ETEL, as confirmed by the calculated standard 
deviations given in Table 3. Interestingly, the distributions (and the standard 
deviations) of the ET and ETEL estimators are quite similar. While the ET 
and ETEL standard deviations shrink by the expected factor of v5 as the 
sample size is increased from 1000 to 5000, the standard deviation of EL 
barely changes, which is not surprising given the results of Theorem 1. Note 
that the difference between the distribution of EL and that of the two other 
estimators can be made arbitrarily large either by increasing the amount of 
misspecification or by increasing the sample size. 




Fig. 1. Cumulative distribution function of the EL, ET and ETEL estimators for the 
Hall-Horowitz design with 4 (left) and 10 (right) moment conditions. The sample size is 
n — 200 and 10000 replications were used to calculate this empirical c.d.f. 
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Fig. 2. Cumulative distribution function of the EL, ET and ETEL estimators for Models 
C and M defined in the text. For the top portion of the figure, the sample size is n= 1000 
and 10000 replications were used. For the bottom portion of the figure, n = 5000 and 2000 
replications were used. 

We can also use simulations to illustrate the source of EL's poor behavior 
under misspecification. Figure 3 shows the implied probabilities for EL and 
ETEL in two simulated samples of size n = 1000 and n = 5000 drawn from 



EL 
«= 1000 



-3 -2 -1 



EL 
n = 5000 



; ETEL • 

; n = iooo 



ETEL t 
n = 5000 : 

t 



2 3-3-2-1 1 2 3 -3 -2 -1 1 2 3 -3 -2 -1 1 2 3 

XI Xj Xj 



Fig. 3. EL and ETEL implied probabilities in simulated samples drawn from the mis- 
specified Model M as a function of sample size. Note the differences in the scale of the 
vertical axes. 
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the misspecified Model M. It is apparent that the EL implied probabilities 
attribute an excessive weight to the extreme observations. As the sample 
size grows, this trend worsens: the second graph exhibits an extremely large 
weight at Xj ~ — 3 and nw, ~ 95. In contrast, the ETEL implied probabilities 
distribute the weight more uniformly over the whole sample and, even more 
importantly, the weights do not become increasingly concentrated in the 
tails as the sample size grows. 

These examples, although simple and perhaps not realistic, illustrate how 
ETEL matches the low-bias property of the EL estimator and shares the 
reasonable behavior of ET under misspecification. 

5. Conclusion. Our first important result is to show that although em- 
pirical likelihood (EL) is known to exhibit numerous desirable higher-order 
asymptotic properties in correctly specified models, its first-order asymp- 
totic properties can degrade catastrophically in the presence of the slight- 
est amount of misspecification, causing the loss of root n consistency. Al- 
though the use of only bounded functions g(xi,9) in the moment conditions 
E[g(xi,9)] = avoids this problem, this is a rather strong constraint. In 
contrast, exponential tilting (ET) is known to be inferior to EL in terms 
of its higher-order properties, but remains well behaved in the presence of 
misspecification under relatively weak regularity conditions [32]. 

Our second main contribution is to show that EL and ET can be combined 
to yield an estimator that exhibits the advantages of both. This so-called 
exponentially tilted empirical likelihood (ETEL) has the same low 0(n~ l ) 
bias and the same 0(n -2 ) variance as EL in correctly specified models, and 
yet avoids EL's pitfalls in misspecified models. 

APPENDIX: PROOFS 

The quantities given in Definitions 1 and 3 will be used throughout the 
Appendix. Let C denote a generic constant which may take distinct values 
in different contexts. Let CSI stand for Cauchy-Schwarz inequality and let 
w.p.a. 1 stand for the phrase "with probability approaching one." 

Table 3 

The standard deviations of the EL, ETEL and ET estimators for Models C and M 
defined in the text. The number of replications is 10000 for the n — 1000 sample and 

2000 for the n = 5000 sample 







n = 1000 






n = 5000 




Estimator 


EL 


ETEL 


ET 


EL 


ETEL 


ET 


Model C 


0.032 


0.032 


0.032 


0.014 


0.014 


0.014 


Model M 


0.054 


0.038 


0.031 


0.052 


0.019 


0.014 
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Proof of Theorem 1. The proof proceeds by constructing a triangu- 
lar array of estimators 9k, n indexed by the sample size n and by an auxiliary 
truncation parameter k. To define this array, let Qk be an increasing se- 
quence of nested compact subsets of IR^ 9 such that UfcLi = ^ Ng - Then 
let Ck = {x £ X : g(x, 9) G Gk f° r an £ @}- Note that is nonempty for k 
sufficiently large. 

Let -Fqo^) denote the distribution of x and let #oo,n denote the EL esti- 
mator obtained from a sample of size n and let 9^ denote EL's pseudo-true 
value, assuming it exists (for otherwise, #oo,n could not even be consistent). 

Let Fi~(x) be a sequence of distributions indexed by k E N, each having 
support Cfc. We choose F k {x) so that, for all sufficiently large k, the mo- 
ment conditions are uniformly misspecified (inf fc> ^ infege ^^[^(x, 9)]\\ > 

for some k G N). Let #/% >n denote the EL estimator in a sample size of n when 
the true data generating process is Fk(x) and let ^£8 denote the corre- 
sponding pseudo-true value. We then note that it is also always possible to 
choose a distribution F k (x) with support C\. such that P[|u'(0fc n — 9 k )\ > 
e] < P[\u' (0oo,n — ^oo)| > £ ] f° r an Y e > 0, any conformable unit vector u and 
all n. For instance, one could first construct a distribution F k {x) equal to 
Foo(x) conditional on the event x £Ck- Let 9 k denote the pseudo-true value 
associated with Fk(x). Then set Ff t (x) to be a mixture of Fk(x) and a de- 
generate distribution that would give 9 k as an EL estimate with certainty. 
In this fashion, Fk(x) is a "truncated" version of F OQ (x) designed to make 
the estimation of 9 k by 9k, n easier than the estimation of 9^ by #oo,n- Ob- 
viously, 9k n is an infeasible estimator that uses out of sample information. 
It is introduced solely for the purpose of facilitating the proof. Note that 
®k 7^ ^oo i n general, but the proof will never require that 9* k = 9^ . 

For a distribution F k {x) having compact support, the EL estimator can 
be written as a just identified GMM estimator of an augmented parameter 
vector (3 = (9' k n , \' k n )' satisfying the first-order conditions 

(24) n" 1 ]T G'{ Xi A,n)h,n/{l ~ \'g(x l ,9 k , n )) = 0, 

i 

(25) n~ l J2 g(x l J k ,n)/(l - \'g{x l A,n)) = 0. 

i 

Note that these first-order conditions form a just-identified system of equa- 
tions, whether the model is correctly specified or not. Hence, in this for- 
mulation the standard asymptotic theory of just-identified GMM estimators 
applies [46] (see also [32] for the application of this idea to ET under mis- 
specification). The asymptotic variance of a just-identified GMM of the form 
rT 1 J2i 4>(xi,(3) = is given by 

(26) (E[d ( j>'(x i ,P)/d^])-\E[m<t>'(P)})(E[d(t> , (x i ,(3)/dp]y 1 . 
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For k sufficiently large, we can always choose Fk(x) so as to satisfy the 
necessary regularity conditions for this expression to hold. In particular, 
the compact support of Fk(x) enables E[g(xi,9)/(1 — X'g(xi,6))] to exist 
for {9',X')' in some neighborhood of the pseudo-true value (9 k ,X* k )'. The 
asymptotic distribution of (9 k , X' k )' is then given by 



n 



(27) 

where 
(28) S k 



1/2 {{G' k , X' k n ) - (9t n , Afe' ))' N(0, H k 1 S k H k 



as n — > oo for fixed k, 



E 



T iG'iX* k X k Gi 
> rhiKGi 



n 9i9i 



(29) 



H, = E 



f i G' i Xf.\%Gi + Ti 



T~igi\ k Gi + TiGi 



09' 



TiG'iX^g'i + TiG i 



ngi9i 



and where n = r(A^ft) = (1 - \ k gi) 1 , n 



dr(0 | 
9£ IC= 



and where all moments are evaluated at and A£ 



Xk, n and 

assuming that x is drawn from .Ffc(x) (i.e., E[-\ = Ep k [■]). 

We focus on the upper left Ng x Ng submatrix of H^ 1 S^H^ 1 , denoted 
by Efc. For a given A;, the submatrix provides the asymptotic variance of 
9k,n- We will now analyze the behavior of T, k as k — > oo (we are not claiming 
that this provides the asymptotic variance of EL for infinite support at this 
point). Since EL's implied probabilities must be positive (see, e.g., [4, 50]), 
it follows that (1 - X^gix^l))" 1 > for all x £ C k , or 



(30) 



max(X* k 'g(x,9* k ))<l. 



Since {g(x, 9 k ) : x E X} is unbounded in every direction, the set {g(x, 9 k ) : £ 
Ck} becomes unbounded in every direction as k — > oo. Hence, the only way to 
satisfy equation (30) is to have A^ — > as k — ► oo. Since X k — > as k — > oo, the 
expressions for Sk and Hk can be simplified by noting that when the product 
H k l SkH^ 1 is calculated, any term containing X* k will be dominated by terms 
not containing X k . We then obtain [keeping the Tj = r{X* k gi) prefactors even 
though X* k — ► because the g(x, 9 k ) are unbounded and it is not clear whether 
we necessarily have Tj — ► 1] 



(31) 



HI 





E[ Tl G[] 
E[TiGi] Elrfg^] 



- -1 






B21 B22 _ 



EMPIRICAL LIKELIHOOD 



25 



(Note that the sequence F k can be easily chosen so that the smallest eigen- 
value H k remains bounded away from zero for all k sufficiently large, since 
the moment conditions remain the same over k and Q k increases with k. 
Hence, lim^^ H k can be assumed nonsingular and interchanging the limit 
as k — > oo and the matrix inversion operation is justified.) We then have 
that 

(32) T lk = B 12 E{Tfg i g' j \B 21 +p k , 

where p k is a remainder that vanishes as k — > oo (its precise form has no 
bearing on the rest of the argument). By the partitioned inverse formula, 

(33) B 21 = (^[^^-^hGi]^^^]^^^])- 1 ^^^])- 1 = B' 12 . 
Substituting this expression for B 2 \ into equation (34) yields 

(34) S fc = {E\T l G'^E\rh l 9^r 1 E\T 1 G l \r x + Pk . 

We will now show that T, k diverges as k — > oo. For EL, X* k is such that 
E[g(x u 9* k )/(1 - \t9(xi,et))) = 0. Since E[g( Xi ,9* k )/(l - X* k 'g(x u 9* k ))} = 
E[g(x ii 9* k )]+E\g(x i ,9t)g'(x i ,9t)/(1 - Xl'g( Xi ,9* k ))X* k ], we have 

(35) n k \* k = -E[g(xi,9* k )}, 

where tt k = E[g{x u 9* k )g'{x h 9%)/{l - X k g(xi,9 k ))]. Since mt k > k E[g{x i: 
9 k )] > for some k 6 N by construction, having X k — > as k — > oo is only 
possible if at least one of the eigenvalues of Q k diverges as k — > oo. Let v be 
a (unit) eigenvector associated with one of these eigenvalues. Then, by the 
CSI v'ft k v equals 



E 

(36) 



v'a{xi,o* k ) , , 
(i-x-gix^Df 9 ^ 



< E 



E[(v'g{x u 9t)f ' 



_{l-Xtg{x h 9l))\ 

Since E[(v'g(xi, 9 k )) 2 ] < sup 9ee E[\\g(xi,9)\\ 2 ] < oo, (36) therefore implies 

r Wa^ifijYf -I _ F[ 2 



that E[ rrrp7-7^gLw ] = E[t 2 v' gig^v] diverges and thus that E\T 2 gig'^\ has a 



divergent eigenvalue. Since E\r 2 gig'^ enters the expression of T, k [given by 
equation (34)], T, k has at least one divergent eigenvalue as k — > oo. Note that 
the other terms entering the expression of Ti k cannot compensate for the ex- 
plosive behavior of E^rfgig 1 ^, since a simple application of the CSI shows 
that, as k - oo, WElnG^^WEKl + TiXi'g^GilW = 0(E[ n \\gi\\\\X%\\\\Gi\\]) = 
^^((^[Tfllftll 2 ])^)^^^.^])!^!!^!! = oiiEtfWgif]) 1 /*) = o((E[t? x 

\\g l g'S) 1 / 2 )=o((E[T?v>g l g>v}) 1/2 ). 

We will now show that the divergent behavior of T, k implies that EL is 
not root n consistent. We start by calculating the probability that 9 k n lies 
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outside of a root ji neighborhood, of the pseudo-true value 0^. Let Pk,n he 
the finite sample distribution of n 1 ' 2 (u'Yl k u)~ 1 / 2 u' (0 k n — 9t) for some con- 
formable unit vector u such that u'Y^^u — > oo as fe — > oo (n is an eigenvector 
associated with one of the divergent eigenvalues of T, k ). Let Pk,oo denote the 
corresponding asymptotic distribution, the c.d.f. of a N(0, 1) for all k. For a 

given £ < 0, the probability that u'{6 ktn -6* k ) < n" 1 /^ is P fc ^((u'Sfcu)" 1 /^). 
Let n fc =min{n:sup m > n |P fc , m ((n'S fc n)- 1 / 2 0-Pfc,oo((n'5] fc n)- 1 /2^| < fc -i} 

This defines the sample size beyond which the difference [at (u'£fcu) _1//2 £] 
between the finite sample and asymptotic distribution is less than A; -1 . Such 
a finite n can always be found, since Pk, n converges pointwise to P kjOC . Now 
define the "inverse" sequence k n = maxjfc :n k < n}. Note that k n — > oo, as 
n — > oo, since n k — > oo as k — > oo. 

Since P[K(0oo,n-OI > e] > P[|it'(0 fein -0£)| > e] for all e > and anyn, 
by the construction of F k , PK(0oo,n - O < n" 1 / 2 ^] > P[u'(6 k ,n ~ 0* k ) < 
n -1 / 2 ^] = P k n ((ii'Sfc?x) _1 / 2 £) for any and any n and any £ < 0. In partic- 
ular, for k = k n , 

p[u\e^ n - o 

(37) 

<n-^]>i^ )n ((,^ u )-V2£) 
= P fcn , 00 ((n'S fcnU )- 1 / 2 £) 

(38) 

+ (P^^u'S^)- 1 ^) - P^du'^u)- 1 /^)) 
(39) ^^^^((n'Sfe^)- 1 /^)-^- 1 

by the definition of k n . As n — ► oo, A;" 1 — > 0. Since Pfc >00 is the same for 
all k and is continuous [it is the c.d.f. of a N(0,1)], for any k we have 
lim n _ >0O P fcni0O ((n / i; fc7i u)~ 1 / 2 £) = lim^oo P k:00 ((u'Y lkn u)~ 1/2 £) = P k)OQ x 
(lim n ^ 0O (ii / S/ Cjj u)~ 1 / 2 ^) = Pfc >oo (0) = 1/2, where we have used the fact that 
(u'E^ii) -1 / 2 — > since u'T, k u diverges as k — > oo. We then have 
lim^Piu'ie^n - O < n' 1 ^} > 1/2 for any £ < 0. A similar reason- 
ing for £ > implies that lim^oo P[u'(#oo,n - > ra~ 1/2 £] > 1/2. It fol- 
lows that ^oo,n hes outside a n -1 / 2 neighborhood of 6^ with probability 
approaching 1/2 + 1/2 = 1 as n — > oo, thus ruling out root n convergence. 

To summarize, for any EL estimator #oo,n based on a distribution F oa (x) 
with unbounded support, there exists a family of other estimators 6 kjTl based 
on compactly supported distributions F k (x) all having a narrower distribu- 
tion than EL for each n. Yet the asymptotic variance of 9 k)n diverges as 
k — > oo. By a standard diagonal argument, there exists an estimator se- 
quence kn ,n that is not root n consistent but whose distribution is narrower 
than the one of EL at each n. Hence EL is not root n consistent. □ 
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Proof of Theorem 2. 



In L = n 1 In nwi = n 1 y^ln( exp(Agj) 

i i \ 

= n~ l ^\'gi -ln^n" 1 ^exp(A / 5- i )^j 
= - In ^n" 1 exp(X'(gj - g))^ , 



n 



1 ^exp(A'^-; 



n 



4 E 



d0' 



1 ^exp(A'#] 



n 



2^exp(A&) — 



n 



"Ed 



Tiro; 



d0' 
d{\'gi 



dff ^ I' ^ v y d8' 

From (15), the first order condition for A is Y^i9i ex P(^ = 0- ^ 

Proof of Theorem 3. Expanding the ETEL first-order conditions for 
8 and A around 8 = 8* and A = reveals an expansion identical to that of EL 
at least up to O p (n -1 / 2 ) in an 0(n~ 1 / 2 ) neighborhood of 8 = 8* and A = 0, 





g(xi,e*) 



i 

+ n" 1 J 



G'(xi,8*) 
G(x h 8*) g(x i ,8*)g'(x i ,8*) 



-l/2> 



Calculational details can be found in [57]. In addition, in a 0(n~ 1//2 ) neigh- 
borhood of 8 = 8*, both the ETEL and the EL objective functions for 8 
share the same expansion in (8 — 8*) at least up to O p (n~ l ), 

-1(8 - 8*)' (n- 1 J2 G'(x h e*)j (fry 1 (n' 1 J2 G(xi,8*)\ (8 - 8*) + o p (n" 1 ), 

where Cl* = n~ l Y J id(xi,8*)g' '(xi,9*). It is known (see, e.g., [47, 49]) that the 
EL estimator is asymptotically such that the solutions Ael and #el he within 
the 0(n~ 1 / 2 ) neighborhood where the remainder terms of these expansions 
are negligible. Hence, asymptotically, Ael and #el also solve the ETEL 
first-order conditions, apart from negligible remainders. Since Ael 0, and 
since both the EL and the ETEL objective functions for 8 converge to their 
maximum possible value when Ael and Aetel 0, respectively, the 
existence of another solution outside of the neighborhood of validity of the 
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above expansions can be ruled out. ETEL thus inherits all the first-order 
properties of EL established in [47, 49]. □ 



Proof of Theorem 4. The first conclusion follows from the fact that 
the implied probabilities are given by 

Wi(6) = exp(A(0)'<?(xi, 0))/ (j2 exp(\(e)'g( Xj , 9))j , 

a necessarily positive quantity for any A and 9. The second conclusion holds 
for any estimator where 9 is the extremum of a differentiable objective func- 
tion: 

d\nL(T((3)) _ dT(P)'dlnL{9) 
dp ~~dp d9 ~ 

if and only if d\nL(9)/d9 = since dT((3)'/d(3 has full rank [T((3) being 
one-to-one] . The third conclusion can be shown by noting that any invertible 
linear transformation of the moment function A(9)g(xi,9) simply causes the 
Lagrange multiplier X(9) to become ((A(9))~ 1 )' X(9). Indeed, under these 
two transformations, the first-order conditions for both 9 and X(9) remain 
satisfied, 

n -i ]T(i - nwi(9)) d(X'(A(9))" 1 A(9)g l )/d9 

i 

= n" 1 £(1 - nwi{9)) d(k' gi )/d9 = 0, 

i 

where vbi{9) = exp(A'(^(^))- 1 A(9)g i )/(J2 j exp(A'(A(^))- 1 A(^)( ?J )) = 

exp(A^)/(E J exp(A' 9j )) and n" 1 £i exp(X'(A(9))- 1 A(9)g i )A(9)g i = 

A(9)n~ 1 J2i ex vQ ^9i)9i = if and only if re -1 J2i ex P(^ <?«')<?«' = since A{9) 
is invertible. □ 



Proof of Theorem 5. By Theorem 2, the first order condition for 9 

is 

dlnL 1 v^/-, - \f /d\ ;,_\ 

2 

(40) =-" 1 E^§+^ 1 E^- (e^)^-a'E-^ 

i i V i / i 

8X 

= g'— + X'G-Q-X'G. 
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To find dX/dO 1 ', we note that the total differential of J2i ex P(^' 9i)9i = yields 
exp(A'5i)5i5- dX + ^ ex.p(X'gi)Gi dO + ^gi exp(X'gi)\'Gi d6 = 0, 

i i i 

^gig'iWi dX + ^2wi(I + giX')Gi d6 = 0, 

i i 

implying that 
(41) 



Substituting this result into equation (40) gives 

= -g'n-' Mi + + x'g - x'g 

(42) = -g'Cl^G - g'Cl' 1 ^ mgiX'Gi + X'G - X'G 

i 

= -g'n-'G - g'n- 1 £ mgiX'Gi + n' 1 ]T(1 - nwi)X'Gi. 

i i 

By the first-order equivalence between EL and ETEL established in Theorem 
3 and using Theorem 3.1 in [47], A(0) = O p {n~ 1 / 2 ) and g = O p {n~ 1 / 2 ) for 
6 such that \\6 — 6*\\ = O p (n -1 / 2 ). These facts, along with the fact that 
sup^gQ maxj< n \\gi\\ = o p {n 1 ^ 2 ) by part 4 of Assumption 1, provide us with 
asymptotic expansions for nwi and A, 

exp(A'^) 

nwi - 



n 1 Eiexp(A'g j ) 
1 + X'g, + 0(CX' 9i ) 2 ) 



l + Wg + Opin- 1 ) 

(43) 

l + X' gi + O p {n- l )\\g 



2 

1 1 1 



1 + Opin" 1 ) + Opin- 1 ) 

i + x' gi + o p (n- l )\\ gi \\ 2 . 



An expansion for A is obtained by noting that the left-hand side of n 1 gi x 
exp(^A) = can be written as 



n 



1 J2 + ftA) + R = n- l Y / 9i+( n' 1 £ 9i9i J A + R 

i i \ i J 
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= g + n\ + Ro + Ri, 



implying that 

(44) X = -n~ 1 g-n~ 1 {R + R 1 ), 

where the remainder terms Rq,R± can be bounded using the assumption 
E[sup e ^ || 5i || 4 ] < oo and (43): poll = O^n" 1 ^- 1 E* hit = O^n- 1 ) and 
Pill < n-^iinwi l)\\gi\\ 2 \\\\\ = n- l EiO(\\X\\\\ gi \\)\\9i\\ 2 * 

IIAHociAllV^Eill^ll^o^ny 1 ). 

Substituting the expansion (43) into the last term of (42) yields 

d -^fl = -g'Cl^G - g'Cl^Y^mgiX'Gi 

(45) 1 

+ n- 1 Y,X'g t X'G t + R 2 , 

i 

whfite HlZaH < OpCn-^n" 1 ^ ||»|| 2 ||A||||G 4 || < OpCn- 3 / 3 ^- 1 ll«i|| 2 l|G 4 || < 
OpCn-^Cn^EillftfJ^Cn^EillGiH^^^OpCn- 3 / 2 ), after using the 
CSI and the facts that E[sup 9e j^ \\gi\\ 4 ] < oo and ^[supg^ ||Gj|| 2 ] < oo. 
Then (45) becomes 

3 

-Cxyn^Y^gA'Gj + Opin- 3 / 2 ), 

j 

where the A in parentheses can be replaced by expansion (44), 

d\nL(0) ./a-i/v ./fi-iv - \>n 
m , =-g fl l^WjgjXGj 

(46) 

j 

where ||i? 3 || = OptOn" 1 E, IM ||A|| \\Gj \\ = O p (n" 3 / 2 )n- 1 E, 11*11 H^H < 

o^n- 3 / 2 )^-^,!!*!! 2 ) 1 / 2 ^^,!!^-!! 2 ) 1 ^ = o p(n -3/ 2) by the cgi? 

E[sup0 e j^ \\gi\\ 4 ] < oo and S[sup 0gjV - ||Gj|| 2 ] < oo. Then (46) becomes 
= -^G + g'n-'n-^l - nw^gfX'G, 
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O p (n^ 2 ) 



(47) 

= -g'n^G - g'Sl^rr^iXg^Gj + R A 



+ O p (n" 3 / 2 ), 

where we have used the expansion (43) again and where H-R4II < [[g'r^ 1 \\ri~ 1 x 
E i O((A / 5i) 2 )||5illl|A||||G' y || = ||5l|||A|| 3 ||0- 1 ||n- 1 E i lb-|| 2 ||^-||||G'ill < 
O p (n- 2 )(maXi< n HsylD^E,- HrfllGjH = P (n- 2 )O p (nV2) n -i £ . || 5j .||2 x 
= O p (n~ 3 / 2 ) by the CSI, the assumptions that E[sup e£j ^ ||gi|| 4 ] < 00 
and -Efsup^jv/ ||Gj|| 2 ] < 00 and the fact that i?[||<7j|| 2 ] < 00 =>■ maxj< n \\gi\\ = 
O p {n l l 2 ) (as in [47], Lemma Al). Finally (47) becomes 

d -^M = -g^ G + O p {n-^). 

Now, the term g'Cl" 1 ^ is similar to the first-order conditions for EL, except 
that the weights used in £1 and G are the ET rather than the EL weights. 
However, by (43) and a similar expansion for the EL weights, ?i(it)i et — 
Wi^el) = O p (n _1 )||gj|| 2 . This fact, along with g = O p (n~ 1 / 2 ), implies that 

= 9' n ~ X Y n ^i,EL9ig'i + R5) Y nw it ELGi + R 6 



= 9 ' \y2wi,EL,9i9iJ \Ywi,EhGij +O p (n 1/2 )O p (n x ) 

by the differentiability of the inverse and the fact that H-R5H < n~ l Ej Op( ra " 1 ) x 
ll^ifllffill 2 = O p (n~ 1 )n~ Xj £i\\9i\\ 4 ' = O p {n~ l ) and ||J? 6 || < n" 1 x 
Y. i O p {ji- x )\\g i f\\G i \\=O p {n- 1 ). 

This implies that the first-order condition for ETEL is the same as that 
of EL up to O p (n -3 / 2 ). The continuous differentiability of g in 6 implies 
II^etel — #el|| = O p (n~ 3 / 2 ) by a standard expansion of the first-order con- 
dition around 9 = 9*. □ 

PROOF of Theorem 7. Lemma A4 in [47] establishes that under reg- 
ularity conditions implied by the ones given in the statement of the present 
theorem, a just-identified GMM estimator (3 defined by n" 1 Ej ^( x u$) = 
admits a stochastic expansion of the form 

(48) k - ft = n-^^t + n-^Qi + n^' 2 ^ + O p (n" 2 ), 
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where 



Qi = E + \ E *ijfc*i**. 

3 j,k 

**=E $ r ? % ^.^E^ 1 ^ *yfc=E $ r ? %,^> 

9 9 9 

^=E $ r>9> ^j=E $ r g 1$ 9 j. *«*=E*i« 1 *w*' 

9 9 9 

^,i^ = E $ r g 1$ 9jfc^ 

9 

~d(/)(xi,/3) 



d(f)i(xi,(3) 



$l,jkh 



13=13* 



d/3j dp k 



(3=/3* 









P=I3*- 



^= re ~ 1/2 E^(^,r 



aft 



<i>, 



/3=/3* 



Zj'fc — n 



d(3j d(3 k 



l,jk 



(We have adapted Newey and Smith's result to follow our notation and 
slightly simplified it using the fact that ^?ijk = ^ikj-) We now write the 
ETEL and EL estimators as just identified GMM estimators that can be 
easily compared. As shown in Lemma 9, and as discussed in Section 3.2.3 
in the text, the ETEL estimator can be written as a subvector of an aug- 
mented parameter vector (3 = (f,k', \',6')' that solves a just-identified vec- 
tor of moment conditions n~ 1 J2i4' ETEh ( x i,$ETEL) =0, where </> ETEL (xj, $) 
is given by (23). 
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It is well known that EL can also be written as a subvector 9 of an 
augmented parameter vector (k',9')' that solves a just-identified vector of 
moment conditions 

(49) n" 1 J2 

i 

where ii = (1 — k' ' gi)~ l and k is the Lagrange multiplier of the moment 
constraints, which has been relabelled k to simplify the comparison with 
ETEL. Once again, to further simplify the comparison, we augment the 
vector in (49) by 1 + dim k additional moment conditions and introduce the 
same number of additional parameters (f , A), where tGR and A G M dimK , 

n _1 E[(^- f ) ^ (iiG'iK)']' = 0, 

i 

where fj = exp(A'gj). In this fashion, the dimension of the vector of moment 
conditions and the number of parameters are the same in ETEL, as in EL. 
The additional moment conditions merely define the values of the new pa- 
rameters (f,A) and do not change the values of (k',9')'. Indeed, whenever 
(k',9')' are such that the bottom two subvectors are zero, one can always 
find a value of (f , A) that will make the top two subvectors vanish as well. 
(There exists A such that n~ lJ ^2 i figi = w.p.a. 1. Then, we can just set 

Finally, since just-identified GMM is invariant under linear transforma- 
tions of the vector of moment conditions, the moment conditions for EL can 
equivalently be written as n~ l J2i 4> EL ( x i> Pel) = 0, where 

fi-T 
Ti9i 

Equipped with (23) and (50), we can construct a stochastic expansion of the 
form (48) for each estimator. The 0(n~ 2 ) covariance between two elements 
of the parameter vector, 9\ and 9 m , is given by 

Wi m = Covai[Q lg+u Qi g+m ) + Covar^+z, #j e+m ] + Covar [^i g+ i,Ri e +m ] , 

(51) 

where lg = 1 + 2 dim A. The quantities associated with each estimator will 
be distinguished by an "ETEL" or "EL" superscript. 

We provide below the sequence of equalities that need to be established 
in order to show, as directly as possible, that ETEL and EL have the same 
0{n~ 2 ) variance. The tedious yet straightforward calculational details that 
prove each statement are omitted below but can be found in [57]. 



EiG'jK 



(50) 



X 
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(1) <J)ETEL = |,EL and $ ETEL = $EL _ ^ . ^ ^ETEL = ^EL 

(2) W^^-^^EiC^-^ti+iE^^-^gi 

'(2a) (* E J EL - *gO^ = E 9 j - *g)*i, where E# E J EL - 

(2b) (* E / EL - ^ k )^% = E $T q H^T - where 

Ei,*(*5r-^*)*i*fc = o. 

(3) (2), (2a) and (2b) =► Q ETEL - Q EL = 0. 

(4) (1) and (3) =► W ETEL - I^ EL = Covar[i? E ™ L - /jg^+J + 
Covar^^+^FJEL.^ELj. 

(5) (1) and (3) =► Rf TEL - Rf L = Ej(*fj EL ~ *Tj)Qj + 



iy"-, 1 .f^ E w EL -^ EL , ^^^i.^ 

6 ^-ij,k,h\ l,jkh *l,jkh)^3 w k* 



/c 



(5a) Ei(*g?g " *F e W<2,j = \T, j H lj -gfg'Pg, where 5 = ^V^.^, 
= (G'n^G^G'n- 1 and P = ft" 1 - U~ 1 G(G'Q~ 1 G)~ 1 G'Q~ 1 . 
(5b) pPj^ - = -lEjHygjg'Pg + with 

E[E l ^ lg+m ] = o(n- 2 ). 

(5c) ^(^ E 7 EL -^ EE )^ fc = 

(5d) i?[(*Eg tt - ^ +l)jkh )^%^ h ^ +m ] = o(n" 2 ). 

(6) (5a) through (5d) => Covar^ 1 ™ - Rf^ +l ,% +m ] = o(n~ 2 ). 

(7) (3) and (6) ETEL and EL share the same 0(n~ 2 ) variance. □ 

Proof of Theorem 8. Let g^ a and g^ b denote the subvectors of gi that 
are mutually independent and let A a and A;, denote the corresponding sub- 
vectors of the Lagrange multiplier. Independence holds if and only if for any 
measurable functions a(g i>a ) and b(g i>b ) } E[a(g ita )b(g i:b )} = E[a(g i;a )]E[b(g iib )] 
whenever these expectations are defined. The exponentially tilted empirical 
distribution estimates the moment E[a{gi a )\ by 



Qa = I n 1 exp(X'gj) J n 1 ^ a(9i,a) exp(A'£; 
\ j / t 



4 (E[exp(X' gj)})" 1 E[a(g ija ) exp(A'<fc)] 

= (E[exp(X' a g i!a ) exp(A' 6 5 ii ;,)])" 1 E\a(g iA ) exp(A^, a ) exp(X b g i>b )] 

_ E[a(gi ia )exp(X' a g iia )}E[exp(X' b g ijb )] 
E[exp{X' a gi !a )}E[exp{X' b gi :b )] 

_ E[a(g^ a ) exp(A^ a )] 



E[exp(X' a g ita )} 



= Qa 
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and similarly for E[b(gi tb )]. The exponentially tilted empirical distribution 
estimates the moment E[a(gi ta )b(gi ib )] by 

Qab = I n~ l exp(X'gj) J n~ l ^ a{g iia )b{g i%b ) exp(A'ft) 

\ 3 ) i 

(E[exp(X' a g i>a ) exp(A' 6 # i;6 )]) _1 E[a(g^ a ) exp(X' a g ita )b(g i:b ) exp(X' b g^ b )] 

_ E[a{gi,g) exp(A^, a )] E\b{g i)b ) exp(Aj,g^)] = Q b = Q b 
E[exp(K9i,a)] E[exp(X' b g iib )) 
by the independence of gi >a and gi ;b under the true untilted distribution. 
Hence plimQa^ = plim Q a plim Q b as claimed. A similar result does not hold 
for EL because (1- A'^)" 1 / (1- A^a)" 1 ^- A^^) -1 , unless maxj< n \X'gi\ ^* 
0, which is impossible under global misspecification. □ 

PROOF of Lemma 9. From (42), the first-order condition for 9 is 

(52) - G'n^g - ]T WiG'M^g + n~ l £ G[X - £ w&X = 

i i i 

(after transposition), where A satisfies 

(53) ^]exp(A / 5 i )c/i = 0. 

i 

Equation (52) contains products of sample moments which are difficult to 
analyze. Our goal is thus to define auxiliary parameters that will allow us 
to rewrite the first-order conditions as a linear function of sample moments. 
Let us introduce the quantity fj = exp(X'gi) and 

(54) f = n- 1 Y^n. 
Noting that wi = n _1 fj/f , (52) becomes 

(55) ~ n~ 1 ^T i G' i \g' i (^~ 1 ^ngi9[\ 9 
+ n~ l J2 G' { X - in- 1 TiG'iX = 0. 

i i 

Now, we introduce k = —(n~ 1 J2i(Ti/f)gig' i )~ 1 g, or equivalently, 



(56) 
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Substituting the k whenever it appears in (55), after multiplying through 
by f , yields 

(57) n~ l nG'ik + n~ l ^ nG'^k + n~ l ^ fG-A - n" 1 ^ nG^X = 0. 

i i i i 

Equation (57) is now linear in the sample moments. Equations (54), (53), 
(56) and (57) can be collected into a single vector of moment conditions 
n" 1 J2i <M a; i>/3) = 0, where j3 = (f, k', A', 9')' and 



(58) <f>(xi,P) 



Tl - T 



(t -fj)gi + ngig'tk 
tiG'ik + hG'iXg'jk - nG'A + fG-A 



[For convenience, the third block is obtained by subtracting (53) from (56).] 
Noting that ^ = f^j, = and ^ = fjG-A, the first expression for 

4>(xi,(3) in (23) also follows. □ 

Proof of Theorem 10. We first establish consistency of (3 in three 
steps: (i) Show that A(0) ^ X*(6) uniformly for 9 £ Q. (ii) Show that 9^9* 
and therefore that X(9) — > A*(0*). (iii) Show that this implies f A r* and 

5*<ep 1. By Lemma 2.4 in [46], continuity of exp(A'g(xj, 0)) in A and 9, 
parts 1 and 4 of Assumption 3 imply that Mg(X) = n" 1 ^ exp(A'g(:rj, 9)) 
Mq{\) = E[exp(\'g(xi, 9))] uniformly over the compact set {(X' , 0')' : X £ 
A(0),0 £ G}, where A(9) is as in part 4 of Assumption 3. We can then 
show that for any r] > 0, P[sup eee ||A(0) - A*(0)|| < rj\ — ► 1, where X(9) = 
avgmm XeA ^ Mg(X) as follows. For a given r? > 0, select e = inf^e x 
inf AgA ( ).|| A _ A *( )|| >r; (M0(A) — Mq{X* (9))), which is nonzero by the strict con- 
vexity of Me (A) in A and the fact that G is compact. By the definition of e, 
whenever sup 9 (M e (X(9)) - M e (X*{9))) < e, then sup 9e6 ||A(0) - A*(0)|| < rj. 
However, using the fact that (M e (X(9)) - M e {X*(9))) < 0, we have 

sup(Af e (A(0))-M,(A*(0))) 

8 

< sup(M,(A(0)) - M {X(9))) + sup(Af e (A(0)) - Mg{X*{9))) 

e e 

+ snp(M e (X*(9))-M e (X*(9))) 
e 

< sup|M e (A(0)) - M e (X(9))\+sup\M e (X*(9)) - M e (X*(9))\ 

e e 

£ £ 

^2 + 2 =£ 
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w.p.a. 1. Hence, sup^gQ ||A(0) — A*(0)|| < rj w.p.a. 1. In order to obtain the 
same conclusion for A(0) rather than A(0), we employ an argument similar 
to the proof of Theorem 2.7 in [46]. Since Mg(X) is convex in A for any 
0, if the minimum A(0) lies in the interior of A(0), no other points in the 
complement of A(0) can achieve a lower value and thus minimizing Mg(X) 
over A(0) or R N ° yields the same answer asymptotically. This establishes 
su Pe€0 ||A(0)-A*(0)||^O. 

Step 2. lnL(0) = -ln(n- 1 J2 i exp{^(9)g(x i ,9))) + X'{9)g(9)^lnL(9) uni- 
formly for 9 G 6, because (i) sup^gg, ||A(0) - X*(9)\\ 0; (ii) sup^ge \\g{9) - 

E[g(xi,9)]\\ since g(xi,9) is continuous in 9 and E[sup g£ Q \\g(xi,9)\\] < 
oo by part 4 of Assumption 3 and by the inequality \s\ < exp(— s) +exp(s) for 
any sGR; and (hi) exp(X'g(xi,9)) is continuous in 9 and £^[supgg sup AgA (6i) 
exp(X'g(xi,9))] < oo by part 4 of Assumption 3 (using Lemma 2.4 in [46]). 
Since lnL(0) is uniquely maximized at 9*, this implies, along with the uni- 
form convergence of lnL(0) and its continuity, that 9-^9*. Since sup^gQ || A(0) — 
X* (0)|| ^ we also have that A(0) ^ A*(0*). 

Step 3. As we have shown that 0-^-0* and A A* and since f and k can 
be written as explicit continuous functions of A and 0, by (54) and (56), it 

follows that f £ E[n] = t* and k ^ {E[ngi{xi, 0*)^(^, 9*)})- 1 (t* E[ gi (x t , 
0*)]) = k* , where the fact that E[Tigt(xi, 9*)g' i (xi, 0*)] is invertible is implied 
by the assumption that T is nonsingular. 

Having established that /3 —> (3 , we now turn to asymptotic normality. 
Since Lemma 9 defines a just-identified GMM estimator, we can use Theo- 
rem 3.4 in [46], specialized to the just-identified case, if we can show that 

(i) .Efsup^gg \\d(fr(xi, f3)/d[3\\] < oo for some neighborhood B of 0* and that 

(ii) E[<t>(xi, (?)<}>' {xi,fr)\ exists. 

The matrix d(j)(xi, j3)/d(5' consists of terms of the form aex.p(k T X'gi)g ka G kG x 
S ks for < kg + kc + ks < 3 and k T = 0, 1, and where g, G, and S, respec- 
tively, denote elements of gi, Gi, and Sji(xi,9) and where a denotes products 
of elements of (3 that are necessarily bounded for (5 £ B. By part 6 of As- 
sumption 3, we can establish (i): exp(k T X' gi)\g\ k9 \G\ k °\S\ ks < exp(k T X'gi) x 
\ b ( Xi )\k 3 +k G +ks ^ J E[ SU p /3 g B exp(A: r A / 5i )| 5 | fc «|G| fcG |5|^] < £[su P/3 g B x 

ex.p(k T X'g{xi,9))(b(xi)) k2 ] = £[sup egAr sup A g A(e) ex.p{k T X' g(xi,9)){b(xi)) k2 ] < 

oo. The matrix 4>(xi, P)4>' i (xi, (3) has elements of the form aexp(k T X'gi)\g\ ka \G\ k ° 
with k T = 0, 1, 2 and < k g + kc < 4 and similar reasoning implies (ii). □ 
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